Skip to content

Conversation

@theunb
Copy link

@theunb theunb commented Sep 29, 2025

Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

@theunb theunb requested a review from a team as a code owner September 29, 2025 17:17
@theunb
Copy link
Author

theunb commented Sep 29, 2025

The logs of builds are not closed and stay in memory. As a consequence the timers org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream$Flush keep on rescheduling themselves for completed jobs.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called by WorkflowRun.finish. Do you have some way to reproduce the issue you observed?

@jglick jglick requested a review from dwnusbaum September 29, 2025 17:33
@theunb
Copy link
Author

theunb commented Sep 29, 2025

I have it locally all the time. We have pipeline scripts (workflow-cps-plugin) and we notice that there is a steady increase of timers 'org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream$Flush' with a link to every build-log (console) that has run since start. We run latest Jenkins and latest plugins. Eventually going OOM and Jenkins restarts.
I can provide jobs config and logs if needed.

@jglick
Copy link
Member

jglick commented Sep 29, 2025

I can only speculate that something in your environment is causing https://github.com/jenkinsci/workflow-job-plugin/blob/c8d7f497b22f41c3acf8d0ea4b7b2c887be959d6/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L672 to not be run; or that this patch was not solving the issue either.

@dwnusbaum
Copy link
Member

@theunb If you can reproduce the issue, we'd be interested to understand what is happening, but on its own the PR doesn't make sense. onProgramEnd adds a FlowEndNode to the program, which (asynchronously) triggers WorkflowRun$GraphL, which calls WorkflowRun.finish, which closes the listener as Jesse linked to earlier.

It sounds like you might already have heap dumps, and if so, it would be interesting to know the concrete types of WorkflowRun.listener and any delegate streams for builds experiencing the problem. If you have configuration and logs (e.g. a build and controller support bundle), that might help us figure out what is happening. I would recommend opening a Jira ticket for investigation purposes until we understand what is happening.

@theunb
Copy link
Author

theunb commented Oct 6, 2025

@dwnusbaum, we notice an ever increasing number of Timers. In two weeks time we have '12643769630' (twelve billion) executed timers and run about 5000 jobs per weekday, at max 600 in parallel.

I have a script to give an overview of these timers in the queue:
---- Script to get the queue ----
def parentClass = jenkins.util.InterceptingExecutorService
def field = parentClass.getDeclaredField("base")
field.setAccessible(true)
def baseExecutor = field.get(instance)
queue = baseExecutor.getQueue()
---- end -----

The rest of the script is long, but it basically goes reflection style to the $Flush task, and counts them as well as print out the referenced OutputStream and file reference and keeps a histogram of time until execution of the task.

The script reports on timer-tasks which have been completed, ie 'callable' and 'runner' are 'null' and 'queue.contains(timertask)' returns 'false' and instances of 'org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream$Flush' and the sum of both.
Here are some samples taken with 10 seconds interval (script also takes a couple of seconds to execute):

  • total: 165565 nullNotInQueue: 74139 outputStreams: 90999 sum: 165138
  • total: 165565 nullNotInQueue: 110482 outputStreams: 54661 sum: 165143
  • total: 165564 nullNotInQueue: 98322 outputStreams: 66820 sum: 165142
  • total: 165572 nullNotInQueue: 149777 outputStreams: 15368 sum: 165145
  • total: 165575 nullNotInQueue: 77326 outputStreams: 87821 sum: 165147
  • total: 165568 nullNotInQueue: 83434 outputStreams: 81713 sum: 165147
  • total: 165571 nullNotInQueue: 148694 outputStreams: 16455 sum: 165149
  • total: 165571 nullNotInQueue: 109282 outputStreams: 55867 sum: 165149
  • total: 165597 nullNotInQueue: 148936 outputStreams: 16241 sum: 165177

I guess the number 'nullNotInQueue' depends on how many 'org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream$Flush' timer-tasks that have been completed. The sum seems to remain stable.
The problem I see is that we run at max 600 jobs in parallel and when I sample the console-log linked to these Flush timers, they all seem to point to completed builds. The Flush timer only takes a weak reference to the OutputStream. I thought the OutputStream might not have been closed. The Flush timer reschedules itself until OutputStream is garbage collected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants